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Abstract. 

We consider the recovery of an underlying signal x £ C m based on projection measurements of 
the form y = Mx + w, where y £ C' and w is measurement noise; we are interested in the case 
I <C m. It is assumed that the signal model p(x) is known, and w ~ CjV(w; 0, £ w ), for known S w . 
The objective is to design a projection matrix M £ C ,xm to maximize key information-theoretic 
quantities with operational significance, including the mutual information between the signal and 
the projections I(x; y) or the Renyi entropy of the projections h a (y) (Shannon entropy is a special 
case). By capitalizing on explicit characterizations of the gradients of the information measures with 
respect to the projections matrix, where we also partially extend the well-known results of Palomar 
and Verdu from the mutual information to the Renyi entropy domain, we unveil the key operations 
carried out by the optimal projections designs: mode exposure and mode alignment. Experiments 
are considered for the case of compressive sensing (CS) applied to imagery. In this context, we 
provide a demonstration of the performance improvement possible through the application of the 
novel projection designs in relation to conventional ones, as well as justification for a fast online 
projections design method with which state-of-the-art adaptive CS signal recovery is achieved. 
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1. Introduction. Compressive sensing (CS) [H [13] has recently emerged as an 
important area of research in image sensing and processing. Compressive sensing 
has been particularly successful in multidimensional imaging applications, including 
magnetic resonance [29] , spectral imaging [17l [45] and video [44j [22] . Conventional 
sensing systems typically first acquire data in an uncompressed form {e.g., individual 
pixels in an image) and then perform compression subsequently, for storage or com- 
munication. In contrast, CS involves acquisition of the data in an already compressed 
form, reducing the quantity of data that need be measured in the first place. In CS 
the underlying signal to be measured is projected onto a set of vectors [Zl [13], and 
one must perform an inverse problem to recover the underlying signal of interest. 

There are two hallmarks of the original CS theory. First, the projection vectors 
were usually constituted uniformly at random. Second, the underlying signal model 
used to regularize the inverse problem was based on the assumption that the under- 
lying signal could be sparsely represented in terms of an orthonormal basis or frame. 
However, even in some of the early CS studies, it was recognized that improved per- 
formance could be achieved with projection vectors designed to the underlying signal 
of interest [24l[TJ[2], rather than using random projections. Further, it has recently 

tW. R. Carson and M. R. D. Rodrigues were supported by the Fundagao para a Ciencia e a 
Tecnologia through the research project PTDC/EEA-TEL/100854/2008. 

iM. Chen, R. Calderbank and L. Carin are with Department of Electrical Engineering, Duke 
University, USA. Their work was supported in part by NSF under Grant DMS-0914892, by ONR 
under Grant N00014-08- 1-1110, and by DARPA under grant N66001-11-1-4002 as part of the KeCom 
Program. 

'W. R. Carson is now with PA Consulting Group, Cambridge Technology Centre, Melbourn, 

UK. 

^M. R. D. Rodrigues is now with Dept. E & EE, University College London, London, U.K. 
A part of the material presented in this work has been published in IEEE ICASSP 2012. 



2 



W. R. Carson, M. Chen, M.R.D. Rodrigues, R. Calderbank and L. Carin 



been recognized that a signal model based upon sparsity is often overly primitive, 
and model-based CS [3], wherein improved signal models are employed, may yield 
improved CS performance (high-quality signal recovery with fewer projection mea- 
surements). Signal models that have been considered include the Gaussian mixture 
model (GMM) [11], union-of-subspace models [15], and manifold models [4]. 

In this paper our goal is to design CS projection matrices ( "measurement kernels" ) 
matched to a general statistical signal model. Specifically, if the underlying signal 
to be measured is x G C"\ it is assumed that we have access to a general signal 
model, represented statistically by density function p(x). Our objective is to design 
the projection matrix to maximize the mutual information between the underlying 
signal and the observed compressive measurements or to maximize the Renyi of the 
compressive measurements. 

The key to the approach considered in this paper is the realization that the 
projection-design problem for CS systems (subject to a power constraint) exhibits par- 
allels with the precoder design problem for multiple-input-multiple-output (MIMO) 
communications systems: in the communications problem a source is being matched 
to a channel whereas in CS a channel, or equivalently the noise covariance, is being 
matched to the source. This link has also been recognized recently by Schnitter [15] . 
who has provided projections designs for sources modelled by multivariate Gaussian 
distributions, as well as by Carson el al. [TU], who have also provided designs for gen- 
eral multivariate source distributions. With the precoder design problem exhibiting 
a long tradition in the information theory and communications field, this link also 
provides the means to translate, with appropriate modifications, much of the design 
know-how and experience from the communications domain to CS. 

The traditional problem of precoder design for MIMO Gaussian channels has been 
drawing on various performance metrics relevant for data communications. Common 
precoder design approaches aim to maximize the system signal-to-noise ratio (SNR) 
and the system signal-to-interfcrcncc-plus-noise ratio (SINR) |33[ 141] or minimize 
the system error probability [5], j!8) . Another emerging precoder design approach 
imbued with operational significance is based on the maximization of the mutual 
information between the input and the output of the system J28[ [34] [27] [36] [49] . This 
novel design principle has been shown to yield considerable rate gains in a variety of 
communications scenarios, due to the fact that, in addition to adapting to the channel 
characteristics, the designs also adapt to important features necessary to achieve high- 
rate reliable communications (the designs conform to the exact characteristics rather 
than only to the second-order statistics of the signaling scheme, as in traditional 
approaches (see [33], [H])). The basis of the emergence of the mutual information 
based designs have been fundamental connections between information theory and 
estimation theory, which have unveiled the interplay between mutual information and 
the minimum mean-squared error (MMSE) in scalar Gaussian channels [19] or mutual 
information and the MMSE matrix in vector Gaussian channels [32]. These results 
offer a means to bypass the absence of closed-form mutual information expressions 
for MIMO Gaussian channels driven by arbitrary (non-Gaussian) signaling schemes. 

The operational significance of mutual information, which acts as the rationale 
for its use as the basis of a plethora of designs, is well known not only in data com- 
munications - it represents the highest reliable information transmission rate in a 
single-user channel driven by a specific signalling scheme - but also in other domains. 
For example, in classification problems mutual information relates (through bounds) 
to the Bayesian error probability of the classifier |21j ; and, in regression problems 
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mutual information relates (through bounds) to the reconstruction error [38j . 

We consider design of the measurement kernel based upon maximizing the mutual- 
information between the underlying signal x and the compressive measurement y. We 
also consider design based upon maximizing the Renyi entropy of y, where the latter 
represents a generalization with operational relevance [16j . The projection design will 
be implemented in practice using gradient descent, and we demonstrate that for a 
GMM signal model the gradient of Renyi entropy with respect to the design matrix 
may be expressed analytically, for a special parameter setting. Further, we recover 
the gradient of Shannon entropy as a special case of the Renyi result. 

The article considers both theoretical results, which disclose key operations ef- 
fected by the projection designs, as well as experimental results that demonstrate the 
merit of the approach as applied to a practical CS imaging problem. One key opera- 
tion relates to the notion of mode alignment in mutual information based designs: the 
modes of the source, which depending on the source statistical model are given by the 
eigenvectors of the source covariance matrix or the eigenvectors of the source MMSE 
matrix, have to align with the modes (eigenvectors) of the noise covariance matrix as 
a means to improve performance. This role can also be conceptually appreciated by 
viewing the measurement kernel as a sieve that aligns relevant statistical features of 
the source to the statistical features of the noise, in order to disclose relevant informa- 
tion for reconstruction. The relevance of mode alignment, which is typically absent 
in communications problems^, has also been recently unveiled in radar applications 
[46] Overbridging the theoretical and practical results is also the formal justification 
of a low-complexity high-performance online projections strategy, the partial direction 
sensing method (PSD) [14], which brings together the main operational features of 
the optimal measurement kernel designs, including mode alignment. 

The detailed contributions of the article include: 

• Recognition that recent advances in communications, which relate to the 
design of precoders for MIMO communications channels, carry over to CS, 
leading to a communications-inspired kernel design framework for CS appli- 
cations. 

• Proposal of mutual information based offline - where a set of projections is 
optimized simultaneously - and online - where the individual projections are 
optimized sequentially - kernel designs. The article unveils key operations 
carried out by the optimal kernel designs for multivariate Gaussian sources 
and general multivariate sources, including the operations of source and noise 
modes exposure, mode weighting and mode alignment. Particular emphasis 
is given to the role of mode alignment as a means to improve further the 
reconstruction performance in compressive sensing applications. 

• Proposal of Renyi entropy based kernel designs. The article also underlines 
some relations between the mutual information (or Shannon entropy) and the 
Renyi entropy based kernel constructions. 

• Formal rationale for the PDS strategy P3] , which is based on the operational 
insight unveiled by the theoretical characterizations of the optimal kernel 



1 This operation is absent in precoder designs for MIMO Gaussian channels driven by Gaussian 
inputs, due to the fact that the signal covariance is often taken to be white, but is present in 
precoder designs for MIMO Gaussian channels driven by non-Gaussian inputs. The role of a certain 
permutation operation in the precoder design is hinted at by Lamarca in |27| . 

2 Note that Schniter 1431 does not recognize the role of mode alignment due to the statistical 
assumptions about the source and noise covariances: this operation is not present when the source 
covariance is the identity matrix or when the noise covariance is also the identity matrix. 
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designs. 

• Partial generalizations of the I-MMSE identity from the mutual information 
(or Shannon entropy) to the Renyi entropy domain. 

• A range of experimental results that illustrate the benefit of the novel mea- 
surement kernel designs in relation to the conventional random ones. 

The remainder of the article is organized as follows. In Section [2] we briefly sum- 
marize the notation used throughout. Section [3] reviews the modeling and design 
approach, introducing key system assumptions. Section 0] introduces the optimal ker- 
nel design based on the Shannon-based mutual information metric - this builds upon 
work on the communications field on precoder design for MIMO channels driven by 
Gaussian inputs and arbitrary inputs. Section [5] introduces the optimal kernel design 
based on the Renyi entropy metric, taking advantage of the closed-form expressions 
available for a GMM source. Section [5] provides the body of evidence that demon- 
strates the performance improvement possible through the application of the projec- 
tions designs put forth in previous sections. We consider examples based on offline 
kernel design, based upon the prior signal model, as well as online kernel design based 
upon sequential update of the posterior, all within the context of a GMM signal rep- 
resentation, which yields analytic CS inversion. Section[7]draws the main conclusions. 
The Appendices contain proofs and supporting mathematical derivations. 

2. Notation and Definitions. In the following text scalar quantities are de- 
noted by italics, vectors are denoted by boldface lower case letters and matrices are 
denoted by boldface upper case letters. The projection of scalar x onto the non- 
negative orthant is denoted (x) + = max(0,x). The superscript (•)* is used to denote 
an optimal solution and the superscripts (■) T , (•)* and (•) denote transpose, conju- 
gate and conjugate transpose operators, respectively. The element in the i-th row and 
j-th column of the matrix X is denoted by [XJjj. The trace of a matrix is denoted 
tr(-). The diagonal matrix with diagonal elements given by either vector x or the 
diagonal elements of matrix X is denoted by Diag(x) or Diag(X), respectively. 

We refer frequently to the following special matrices and sets: the n x n identity 
matrix is denoted I n , the nx n flipped identity matrix with ones on the anti-diagonal 
is denoted J„, the m x n matrix of all zeros is denoted mx „; the sub-scripts may 
be dropped where no confusion may arise. The set of all n x n unitary matrices is 
denoted §" and the set ofmxn complex matrices is denoted C mXn . 

The notation x ~ £W(x;/z, 53) denotes a random variable x which is circularly 
symmetric complex Gaussian distributed with mean fi and covariance matrix 53. 

3. Modelling and Design Approach. In CS, we aim to reconstruct the signal 
of interest x e C m based on a small number of noisy projections: 

y = Mx + w, yeC e (3.1) 

with £ < m and where M £ C fxm is the kernel (or projection) matrix and w repre- 
sents zero-mean circularly symmetric complex Gaussian noise with positive definite 
covariance matrix 53 w , i.e., w ~ £A/"(w; 0, 53 w ). The action of the kernel can be un- 
derstood in terms of two separate projections and a power allocation (or stretching) 
operation, which are associated with the matrices in its singular value decomposition 
(SVD) given by: 



M = U M A m V- 



(3.2) 
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where A M = [Diag y^) ix(m _ e) ] G M £xm , U M S 8 f , V M e S m , 

and Aj\/i > > • ■ • > ^JVf« > correspond to the (non-negative) eigenvalues of 

M Mt. 

Both the signal and the noise covariance matrices are positive (scmi-)definitc and 
can also be represented in terms of their eigenvalue decomposition (projections and 
power allocation). In particular, the signal covariance matrix is given by: 

S x = U x A x U+ (3.3) 

where U x G § m , A x = Diag (X Xl , . . . , X Xm ) and X Xl > X X2 > . . . > \ Xm > are the 
(non-negative) eigenvalues of S x . Similarly, the noise covariance matrix is given by: 

S w = U w A w U+, (3.4) 

where U w G E> e , A w = Diag (X Wli . . . , X We ) and < X wl < X w2 < ■ ■ ■ < X wt corre- 
spond to the (non- negative) eigenvalues of S w . 

Our design approach, which relies not only on a statistical model for the noise 
but also on the signal, draws on specific quantitative metrics in order to conceive and 
compare various possible kernel designs. A natural metric, which relates to the best 
achievable reconstruction error, is the (non-linear) MMSE given by: 

MMSE = E jtr (x - E {x|y}) (x - E {xly})* j (3.5) 

that involves the use of conditional mean estimation to recover the signal of inter- 
est from the noisy projections, i.e., x(y) = E{x|y} [25] , We, however, capitalize 
on information-theoretic metrics, most notably the mutual information and Renyi 
entropy based on the fact that mutual information and Renyi entropy - in view of 
recent developments in information theory and communications - appear to be more 
amenable to mathematical analysis than the non-linear MMSE. In addition, it is also 
possible to bound the MMSE via the mutual information as follows [38] : 

MMSE > — cxp 2 \H X (x) - 1 (x; y)] . (3.6) 
zne 

where % x (x) denotes the differential entropy of x and X (x; y) denotes the mutual 
information between x and y. 

The crux of our design approach, which we also partially extend from the mutual 
information to the Renyi entropy metric, is a fundamental result that links the gra- 
dient with respect to some parameters of the mutual information between the input 
and the output of a linear vector Gaussian channel model and the MMSE matrix as- 
sociated with the model: known as the I-MMSE relationship. This result, which was 
originally put forth for the linear scalar Gaussian model by Guo, Shamai and Vcrdu 
[19) and later for linear vector Gaussian channels by Palomar and Vcrdu in [32) . can 
be directly applied to the model in (|3.1|) so that: 

V M X(x;y) = E w 1 ME (3.7) 

where the MMSE matrix i|J 

E = E{(x-E{x|y})(x-E{x|y}) t } (3.8) 

= U E A e U e (3.9) 



3 Note that the MMSE matrix E is a function of the kernel M 
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where U E e §"\ A E = Diag (A^ , . . . , A Bm ) and X El > X E2 > ■ . ■ > X Em > are the 
(non-negative) eigenvalues of the MMSE matrix. 

Our design approach also draws on a specific kernel design constraint. It is im- 
portant to recall that in CS applications the kernel design is typically set to obey 
unit-norm row constrains or, instead, orthonormal constraints |30j . In contrast, in 
communications applications the kernel (or prccodcr) design obeys a power (trace) 
constraint, which states that on average the rows have unit-norm. A paper that does 
consider this power constraint for CS is the work by Schnitter [43], however, this is 
unusual in the CS field. We adopt this more general constraint, which, in addition 
to leading to solutions with higher mutual information or Renyi entropy, enables the 
formulation of the design framework which the unit-norm rows constraint does not. 
The exception to this is the special case of adaptive online design in Section llOl where 
each row of the kernel is designed sequentially such that the two constraints coincide. 

4. Mutual Information based Kernel Design. In this section we consider 
the characterization of the kernel that maximises the mutual information of the model 
in (|3.ip . subject to a power constraint, for multivariate Gaussian sources and general 
multivariate sources. The optimal kernel design for multivariate Gaussian sources also 
provides a rationale for other kernel designs in subsequent sections, most notably, the 
PDS method (we extend the work of [14]). The design problem can be posed abstractly 
as follows: 

maximize I (x; M x + w) 

M , N 

1 (4-1) 
subject to - tr (MM 1 ') < 1 

It is important to remark that this optimization problem is non-convex in general. 
The use of the fundamental result in ()3. 71) . in addition to enabling the full or partial 
characterization of the solution, also leads to efficient computational procedures. We 
restate next the characterizations of the optimal kernel designs for Gaussian sources 
(Theorem 14. ip and general sources (Theorem I4.2[) , which also appear in slightly dif- 
ferent forms in [36] [35] [27] , in a manner that emphasizes the operational significance 
for CS applications. 

4.1. Multivariate Gaussian Input Source. The characterization of the op- 
timal kernel design for a multivariate complex- valued Gaussian source leverages the 
well-known closed-form mutual information expression given by: 

2 (x; y) = log det (l m + M t S w 1 M S x ) . (4.2) 

This simple closed-form expression allows the use of simple matrix identities, rather 
than the gradient result in (|3.7j) . to obtain the solution to (|4.1|) . The case when 
S x = I is well-known from communications theory and was recently applied in the 
design of measurement kernels by Schniter [43] . However, the case for general source 
covariance matrices has not been studied in the communications domain. We unveil 
that this leads to the novel operation of mode alignment 

Theorem 4.1. The kernel matrix that solves the optimization problem in (|4.1[) 
for a multivariate complex-valued Gaussian source with covariance matrix S x is given 



4 This result was also recently shown in radar I46| . 
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(a) Exposing modes (b) Aligning and or- (c) Watcrfilling power 
dering modes allocation 



Fig. 4.1. Diagrammatic view of the actions of the optimal kernel design. 



M* = U w U£ (4.3) 



where = [Diag (y / A^~, ■ • ■ , V^M*) ®lx{m-£)], Ku = [}j ~ A^f J with the 
noise covariance eigenvalues X wi arranged in ascending order and the source covari- 
ance eigenvalues \ xi arranged in descending order and r\ ensures the average unit- 
norm row constraint, i.e., \ tr (MM*) = 1. 

Proof. See Appendix [Cl □ 



Thcorcm l4.1l vmcovers the operations of the optimal kernel design. In particular, it 
is possible to recognize a novel mode alignment operation which involves two aspects: 
i) exposing the modes of the noise and source covariance; and ii) ordering (or aligning) 
the modes. 

First, the left-singular vectors of the kernel are chosen to align with the eigen- 
vectors of the noise covariance matrix and the right-singular vectors of the kernel are 



chosen to align with the eigenvectors of the signal covariance matrix (Fig. 4.1(a)). 
This is referred to as exposing the modes. 

The ordering (or alignment) of the exposed modes is very particular, the largest 
source eigenvalue is matched to the smallest noise eigenvalue, the second largest source 



eigenvalue is matched to the second smallest noise eigenvalue, and so on (Fig. 4.1(b) ) 



Finally, the kernel "weights" the matched modes according to a "watcrfilling" 



interpretation [12] (Fig. 4.1(c) |. Intuitively, this emphasizes the less noisy "channels" 
and reduces the influence of the noisier ones as a means to maximize further mutual 
information. 

As an example, Fig. 14.21 depicts the mutual information associated with two 
possible alignments for the signal and noise eigenvalues in a scenario where both 



5 Note that the superscript * denotes an optimal solution. 



8 



W. R. Carson, M. Chen, M.R.D. Rodrigues, R. Calderbank and L. Carin 



14 




SNR (dB) 

Fig. 4.2. Mutual information as a function of SNR, for two different alignments both with 
optimal power allocation. 



covariance matrices are diagonal, A x = Diag (1,0.25) and A w = Diag (1, 0.25). It 
is evident that the ordering of the modes has a significant impact on the mutual 
information at low and medium SNR - the highest mutual information corresponds 
to the kernel design that aligns the strongest source eigenvalue with the weakest noise 
eigenvalue, Um = U w = J 2 and Vm = 12- 

4.2. General Multivariate Input Source. While the application of commu- 
nications theory results for Gaussian distributions are known to varying degrees out- 
side the field of communications theory, the results for general sources have not been 
fully leveraged outside of communications. The characterization of the optimal kernel 
design for a general multivariate complex- valued source, in view of the absence of 
closed-form mutual information expressions, now leverages the fundamental result in 



Theorem 4.2. The kernel matrix that solves the optimization problem in (|4.1[) 
for a general multivariate complex-valued source with covariance matrix S x is given 
by: 

M* = U w (4.4) 

where V*^ = II* 0, the matrix II* is the optimal permutation matrix, A*^ = 
[Diag (a/A^ , . . . , \f^\j e ) 0] , and X M . are given by the generalized mercury water- 
filling solution, i.e., 

|0, n \ w . > mmse.; |U£. II* A*Jv =o) 
X* M . = } _ 1 w ' \ Q |A "« °J (4.5) 

l^mmse^ 1 (n A^) , otherwise 

where r\ ensures the average unit-norm row constraint, i.e., j tr(M*M*^) = 1, Aq = 
a m a m> A Qk, =o = Diag [X* Ml , . . . , X* Mi l , 0, X* M . +1 , . . . , X* Mi ) and mmse ! (U* n*, A* Q ) 



6 Note that the MMSE matrix E is a function of the kernel M. 
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denotes the i-th diagonal entry of the MMSE matrix associated with the estimate of 
x' = II*tU]E x from 

y> = A W V 2 x ' + n, (4.6) 

where n is zero-mean circularly symmetric complex Gaussian noise with identity co- 
variance, n ~ C/V(n;0,I). Note that mmse" 1 is the inverse o/mmsej with respect to 
Ajwi for fixed \ Mj , Vj ^ i. 

Proof. See Appendix [D] □ 

Remark 1. In the high noise/low signal power regime, a first- order expansion of 
the mutual information is given \32$ : 

Z(x;y) = itr(E w 1 ME x Mt)+ (||S x ||) (4.7) 

which implies the result observed by Shannon that at low signal-to-noise ratios proper 
complex discrete inputs offer a negligible loss in performance terms with regards to the 
capacity achieved by Gaussian inputs; hence the results for the Gaussian in Theorem 
\4-l\ also apply in general for proper complex sources in the high noise/low power 
regime. 



Theorem 14.21 suggests that the mode alignment is no longer between the eigen- 
vectors of the source covariance and the eigenvectors of the noise covariance, but 
between the eigenvectors of the MMSE matrix and the eigenvectors of the noise co- 
variance. The diagonalization of the MMSE matrix was first noted for communica- 
tions by Lamarca |27j for identity source covariances, and the same holds true for CS 
for general source covariances. The singular values of the kernel are described by the 
mercury waterfilling algorithm |28j |36j which differs from waterfilling by adjusting for 
the non-Gaussian nature of the inputs, however, the procedure is remarkably similar. 

It is important to emphasize that Theorem 14.11 characterizes fully the optimal 
kernel design but - in view of the non-convexity of the problem - Theorem 14.21 char- 
acterizes partially, via a fixed point equation, the optimal kernel since is still a 
function of M. The characterization is useful because it leads to i) stopping criteria 
for gradient descent algorithms via (|3.7p : and ii) alternative optimization algorithms. 
Note that if we implement gradient descent with (|3.7[) we may get trapped in local 
maxima since it is known that the mutual information is not always a concave func- 
tion of M [34] . However, mutual information is known to be concave in the squared 
singular values of M, for Um = U w and fixed Vm- An alternative gradient de- 
scent algorithm that leads to the global maximum by avoiding local maxima switches 
between optimizing the singular values and the right-singular vectors of the kernel 

5. Design with Renyi Entropy. We consider the characterization of the kernel 
that maximizes the output Renyi entropy of the model in (|3.1|) . subject to a power con- 
straint, for multivariate Gaussian sources and multivariate Gaussian mixture sources. 
The design problem can then be posed as follows: 

maximize h Q (M x + w) 

M , X 

1 (5-1) 
subject to - tr (MM t ) < 1 
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where 

My) = j— — log / P a (y)dy. (5.2) 
I- a J 

Note that Rcnyi entropy represents a generalization of Shannon entropy given by: 

My) = - J p(y) logp(y)dy (5.3) 

which is the special case when a = 1. 

5.1. Multivariate Gaussian Input Source. For multivariate Gaussian sources, 
both Shannon entropy and Rcnyi entropy can be expressed analytically for all values 
of a > 0. In particular, the two are shown to be related in the following theorerr0: 

Theorem 5.1. 

For a multivariate Gaussian input source where x ~ CA/"(x; /a, S x ), the Renyi 
entropy of order a > and the Shannon entropy associated with the output of the 
model in fl3.1[) are related as: 

h a (y)= h s (y)-e(l-^-), (5.4) 



where h s (y) = log [(2weY det (S w + MS X M+)] 
Proof. Sec Appendix [EJ □ 



Theorem 15.11 leads immediately to a generalization of the I-MMSE identity in 
()3.7|) for Gaussian sources: 

Theorem 5.2. For Gaussian sources, the (complex) gradient with respect to the 
kernel of the output Renyi entropy of order a > associated with the model in (|3.1I) 
obeys the relationship: 

V M My) = S w x M E. (5.5) 



Theorem 15.21 unveils that the relationship between mutual information and the 
MMSE matrix in (|3.7[) also holds for all values of a > for the output Renyi entropy 
associated with the model in (|3.ip for Gaussian sources. Theorem l5 . 21 also implies that 
the kernel design that maximizes the Renyi entropy subject to a power constraint also 
obeys the characterization in Theorem 14.11 

5.2. Multivariate Gaussian Mixture Model Input Source. For Gaussian 
Mixture Models (GMM) the signal x e C m is represented by: 

N 

p(x)=53p(i)CA/'(x;A» i ,Si) > ( 5 -6) 

i=l 

where p(i) is the probability of occurrence of mixture component i, and Ej cor- 
respond to the mean and covariance matrix of the i-th circularly symmetric complex 



7 For quadratic Renyi entropy this result was also in Appendix A of 1421 . 
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Gaussian distribution. Neither the Shannon entropy, mutual information nor the 
MMSE matrix are known to have closed-form expressions for GMMs. Renyi entropy 
and its gradient, however, admit closed-form expressions in some instances, which lend 
themselves more easily to optimization via gradient descent algorithms. For example, 
the quadratic Renyi entropy of the noisy projection y in (|3.1[) is given by: 

N N 

h 2 (y) = -log£5>(i) p(j) (():/., (5.7) 

»=i i=i 

where: 

/',., = M (//, - // ,) (5.8) 

Sij =M(S. 1 + S 3 )M t + 2S w (5.9) 

The complex gradient with respect to M of the quadratic Renyi entropy of the 
noisy projection y in (|3.1[) for the GMM is given by: 

V M h 2 (y) = 

N 

- Y. P(«) CM (°5 »i,v S «) V m logCAT (0; //, ,. Eij) 



iV JV 

i=i j=i 



(5.10) 



where: 



V M log CNij = - Sr 1 M (Si + S 



x !\IE ;M (S, + S,) - 1} . (5.11) 

where x denotes a matrix multiplication. The proof is given in Appendix [F] 

It is interesting to note that the now celebrated I-MMSE relationship in the 
information theory literature also applies for Renyi entropy of order a > associated 
with Gaussian source models. However, this relationship does not seem to carry over 
for the Renyi entropy of more general source models. In fact, it can only be shown 
that for a general source, which obeys some additional smoothness conditions, the 
gradient can be expressed as follows (The proof is a modification of the result in [35]): 

V M My) = ftS w' / P(y) (y - Mx 9 ) xtdy (5.12) 

where the probability distribution p(y) = j |a (y)d y anc ^ x v 1S ^ ne conditional mean 
estimator. 

It is not difficult to appreciate that the right-hand side of (|5. 12|) is in general 
different from the right-hand side of (|5.5|) (or the right-hand side of the I-MMSE 
relationship in (|3.7p ) by studying the Taylor expansion of Vm h.2(y). For the high- 
noise power scenario, the first term in the expressions coincide but higher order terms 
do not ESI. 
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6. Application to Compressive Sensing. 

6.1. Problem setup. We consider CS in the context of imaging. While the 
theory is applicable to complex data, the following examples focus on real images. 
Specifically, consider measurement of the image X £ H N x xN y. for large N x and N y . 
As indicated in Figure [uTTl the image is partitioned into n x x n y contiguous "patches," 
with the pixels in the jth patch denoted by vector Xj £ R £ , with £ = n x n y . In the 
examples considered here n x = n y — 8 (consistent, for example, with the patch sizes 
used in the JPEG standard). 

It is desirable to partition the images into such patches because one may readily 
learn a signal model for the {xj}, while it is difficult to learn an accurate signal model 
directly on the entire image X. Specifically, following [HI [14], we assume that each 
Xj is drawn from a GMM of the form (|5.6p . here for real normal distributions. 

To learn the prior signal model p(x) for the patches, we first consider a large 
ensemble of natural images, from which patches Xj £ M. £ are selected at random. Us- 
ing these training data, a (real) GMM of the form in (|5.6|) is constituted as a signal 
model. To learn this GMM, we have employed nonparametric Bayesian methods as 
in [11], as well as expectation- maximization (EM) methods [14], and both methods 
yield very similar results. The following results are based on a N — 20 component 
GMM, trained on 100,000 patches, extracted at random from 500 natural images 
in the Berkeley Segmentation Dataset (http : //www . eecs . berkeley . edu/Research/| 
|Projects/CS/vision/grouping/resources .html[ ). These training images are dis- 
tinct from those considered in the testing phase, for CS inversion. 

While patches are selected at random from training images to constitute the 
prior p(x), when performing CS the goal is to recover the entire underlying image 
X. Therefore, for CS inversion we wish to recover each of the {xj} in Figure [67X1 In 
general, a separate projection matrix Mj is applied to patch j from image X. For the 
case of offline design of the projection matrix, Mj is the same for all patches j (since 
it is non-adaptive). For online design a distinct Mj is adaptively designed for each 
testing patch j. The measured data associated with patch j is expressed as 



In the examples that follow, the images under test are 256 x 256, and therefore this 
procedure was employed on J = 1024 non-overlapping patches of size 8x8. Each of 
the Xj are recovered independently from the respective measured yj , thereby allowing 
for massive parallelization. 

For simplicity, we henceforth drop the subscript j, and the discussion that follows 
applies to each of the J patches in Figure I67H We assume the noise w <~ A/"(0, S w ), 
with known covariance matrix S w . In the following examples we consider low-noise, 
i.i.d. measurements, and therefore S w = 10~ 6 I^. The likelihood function for the 
underlying signal x is Af(y; Mx, S w ), and the prior p(x) is the aforementioned GMM, 
p{x) = WiAf(x; fij^, S^. Under this likelihood function for x, and with the GMM 

prior, the posterior p(x|y) is also a GMM: 



Yj = M j x i+Wj, j = 1,...,J 



(6.1) 




(6.2) 



with 




(6.3) 
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Fig. 6.1. Spatial grid used for CS measurement image N x X N y image, decomposing the image 
into a contiguous grid of "patches, " each patch composed of n x X n y pixels, n x <C N x and n y -C N y . 
Letting S MJ lxn y represent the pixels associated with the ith patch, separate projection matrices 
Mi are designed for each x^ . 



Wi = mN(y, M Ml , M£,,M T + Ew)/p(y) (6.4) 



p(y) = J2. i ^( 1 /;M / i i ,MS j M T + S w ) (6.5) 
When presenting results, the estimated signal x is the mean based on p(x|y), i.e., 



6.2. Offline and online design. We consider online and offline design of the 
projection matrix M, based upon gradient descent: M <— M + jVmZ{x; y), with 
re-normalization to satisfy the power constraint; here we perform a gradient of the 
mutual information, and the same type of gradient descent is performed in the context 
of Renyi entropy, for which we therefore employ the results of Section IV. When 
employing the gradient of Renyi entropy, we employ (I3.7I) . The design of M based 
upon a gradient of mutual information is denoted PV, for Palomar and Veru. 

For offline PV design, p(x) corresponds to the learned prior GMM, and the entire 
M is inferred at once. For online PV design, after measuring the first k components 
of y, denoted yi : fc, we update the posterior p(x|yi : fc) via (|6.2[) , and row k + 1 of M 
is constituted based upon this posterior signal model; after each measurement, the 
posterior is updated, followed by design of the next row of M, used to define the next 
measurement. In these computations, the MMSE matrix in ()3.8|) is computed via 
Monte Carlo integration, based on draws from p(x) (in the offline case) or p(x|yi : fe) 
(in the online case). Online design of the patch-dependent projection matrix M may 
be performed in parallel. 
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For Rcnyi-bascd design wc consider the case a — 2; this is convenient, as within 
the context of the GMM representation employed here the gradient with respect to 
M is analytic, via (|5.10l) . 

The online PV design is relatively expensive, as one must repeatedly perform 
Monte Carlo integration to update the MMSE matrix E, and one must also perform 
gradient descent. For online Rcnyi-bascd design we employ (|5.10l) ; while this analytic 
expression precludes the need to numerically compute E, the large number of sums 
makes online Renyi and online PV design comparably expensive. 

The relative expense of Renyi and PV online design motivates a simplified online 
design. In [Tl] the authors proposed the PDS method, in which a GMM was used for 
p(x). In |14j . the components of the first k < £ rows of M are drawn i.i.d. from a zero- 
mean normal distribution. Using this fc-row sensing matrix, an initial measurement 
yi:fe £ R fc is performed. Based upon yi : fe, the most probable mixture component from 
the prior p(x) is selected. At this point a single-Gaussian signal model is constituted. 
The remaining £— k rows of M are then defined by the principal £— k eigenvectors of the 
covariance matrix from this Gaussian. While [14] did not have access to our Theorem 
1, the design so constituted is consistent with it. Specifically, Theorem 1 applies to 
the case of a single-Gaussian signal model. Under the aforementioned assumptions for 
S w (diagonal covariance matrix, with small diagonal variance), Theorem 1 implies 
that the optimal projection matrix corresponds to the principal eigenvectors of the 
covariance matrix. However, the assumption of k initial random projections employed 
in |14j . before selecting a single Gaussian component, seems undesirable. Further, in 
[T4] the single Gaussian was selected from the prior p(x) rather from the updated 
posterior p(x|yi :fc ). 

We extend the PDS technique to an online setting as follows. We first initialize 
p(x) with the GMM prior signal model (learned using offline training data). We then 
sequentially constitute one row of M at a time, from k = 1,. . .,£; after each row is 
so constituted, a single new projection measurement is performed with that new row. 
Again let yi : fc represent the vector of data constituted in this manner via the first k 
rows of M. Based upon these data we update the signal model p(x|yi : fc). To design 
row k+1 of M, let i' = argmaXjWi, where the {uii} are the GMM mixture weights from 
p(x|yi : fc). Then the (k + l)th row of M is defined by the leading eigenvector of S^. 
The online PDS approximates the posterior GMM at each step with the dominant 
Gaussian from the posterior GMM p(x|y 1: fc), and then via Theorem 1 the next row of 
M is defined by the leading eigenvector of the associated covariance matrix. Since no 
Monte Carlo simulation and gradient descent are needed in the above process, online 
PDS method is very fast. The eigenvectors are orthonormal, and therefore the power 
constraint is satisfied automatically at every step. Note that the posterior p(x|yi : fc) 
continuously updates with increasing data, and therefore it is not particularly sensitive 
to the prior p(x); the original PDS in |14j was based upon the prior p(x) only, which 
may necessitate more care in selection of the training data. Since the posterior can 
be updated easily via (|6.2p . it appears highly preferable to use this approach rather 
than fixing the signal model. 

6.3. Experimental Results. In Figures 16.2116.41 results are shown for three 
widely examined test images: 'barbara', 'house' and 'pepper,' respectively. Two 
classes of results are considered based upon random projection design. The "ran- 
dom GMM" results employ the patch-based CS construction in Figure 16.11 and the 
learned GMM-based prior p(x). The form of these results are the same as employed 
for the designed Mj , except here each Mj is constituted with matrix elements drawn 
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Fig. 6.2. PSNR for the reconstructed 'barbara' image. 
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Fig. 6.3. PSNR for the reconstructed 'house' image. 



i.i.d. from Af(0, 1), followed by normalization. We also considered CS design in which 
the projections are performed directly on the entire image X, rather than at the 
patch level, as in Figure 16.11 If one performs CS inversion based on traditional CS 
algorithms, which employ t\ and related regularization [8], the quality of the inver- 
sion is markedly worse than that using the proposed approach, with learned signal 
models p(x); we therefore do not show these results here, because they don't fit on 
the same scale of results presented. This is not surprising, as the patch-dependent 
learned signal model p(x) is much richer, and tailored to the data than simple sparsity 
constraints, which motivate t\ regularization. To provide a fairer comparison, when 
performing inversion for the case in which the projections arc performed directly on 
the entire image X, we consider an underlying wavelet basis and perform inversion 
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Fig. 6.4. PSNR for the reconstructed 'pepper' image. 



based on the sophisticated hidden Markov tree (HMT) wavelet model for images [20] . 
This signal model p(x) could in principle also be used within the theory to design a 
projection matrix applicable to the entire image. However, the significant advantage 
of the GMM construction is that the posterior of the underlying signal may be consti- 
tuted analytically, while for the HMT expensive computational methods are needed 
[20] . Therefore, we only show HMT inversion results when the projection matrix is 
constituted at random, thereby providing a comparison of inversion quality of the 
GMM (patch based) and the HMT (entire image), based upon random projections. 

We consider offline design of the patch-based projection matrix M based upon the 
Renyi measure of entropy, as well as based upon mutual information (via the PV the- 
ory). For online Renyi and PV design, we do not make a simplifying single-Gaussian 
assumption when designing each row of M. By contrast the online PDS method uses 
the most probable Gaussian from the posterior to design the next projection at each 
step (this is therefore an approximation). The PDS method is very fast, while online 
PV is expensive, and therefore is shown principally for comparison (may not be done 
in practice, where online design must be fast). 

First comparing the results based on random projections, the results based upon 
the (learned) patched-based GMM and based on the entire-image-based HMT are 
comparable in reconstruction quality. Sometimes the GMM results are slightly better, 
and other times the HMT results are better. However, there is no comparison with 
respect to computation speed. The HMT results are expensive, being based upon a 
Gibbs sampler [20]. By contrast the GMM results are very fast, with the inversion 
analytic. The additional big advantage of the GMM representation is that it allows 
convenient design of patch-dependent projection matrices, which we consider next. 

Each of the designed projection methods yield significant improvement relative 
to random, and after approximately 6 projections per patch we note that the on- 
line results are significantly better than offline design. For the first approximately 5 
measurements per patch, the offline and online results are comparable; we attribute 
this to an inadequate number of measurements to obtain an accurate signal model, 
and therefore little gain manifested by adaptivity. However, after approximately 6 
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measurements per patch it appears that the posterior signal model becomes accurate, 
yielding advantages of adaptivity. Concerning online design, inversion quality based 
on the simple and fast online PDS performs quite competitively relative to the on- 
line Renyi and PV design (which do not make a simplification to a single Gaussian), 
despite the fact that it assumes that the patch is drawn from a single Gaussian. 

To understand the quality of the simple PDS-based design, consider Figure 1675} 
wherein we plot the probabilities {wi}i=i,N, for the posterior p(x|yi : fc), as the number 
of measurements k increases from 1 to I. Note that after approximately six measure- 
ments the model has inferred that the underlying signal x was drawn from a single 
multivariate Gaussian. Note that the GMM is characteristic of an ensemble of draws, 
like those characteristic of the multiple patches in Figure 16.11 However, any single 
patch is drawn from a single one of the mixture components; it is however unknown a 
priori which component. Based upon experiments of this type, typically 6 projections 
are sufficient to infer which single mixture component a given patch corresponds to. 
At this point the results in Theorem 1 apply directly, which under the assumption for 
S w dictates that the optimal measurement corresponds to projecting onto the dom- 
inant eigenvector of the covariance matrix of the single mixture component (single 
Gaussian); this is precisely what PDS does. 




2 4 6 8 10 12 14 16 18 20 

Number of Projections 



Fig. 6.5. Evolution of the mixture weight in the posterior GMM for a typical testing patch in 
'barbara '. 

7. Conclusions. We observe that the design principle of maximizing mutual 
information or Renyi entropy leads to deterministic kernel matrices for which MMSE 
performance is superior to that of random kernel matrices. In particular, we are able 
to provide design principles for the optimal kernel matrix for a general multivariate 
source that maximizes the mutual information or Renyi entropy (for which Shannon 
entropy is a special case). We showed that the optimal kernel exposes the modes of 
the noise and the modes of the (optimal) MMSE matrix, then performs an alignment 
operation whose purpose it to optimally match the modes of the noise to the modes 
of the MMSE matrix (or, in the multivariate Gaussian source scenario, the modes of 
the source covariance). Finally, it carries out a generalized mercury- watcrfilling power 
allocation operation. 
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The theoretical framework has been demonstrated with application to compressive 
sensing (CS) as applied to imagery. Using a GMM signal model, it was demonstrated 
that designed measurement kernels can yield markedly improved CS signal recovery 
relative to random design. The GMM representation has the advantage of yielding 
closed-form CS inversion, which is particularly attractive for fast signal inversion and 
for online kernel design. We have enhanced an online kernel design framework first 
proposed in |14j , and have also provided a theoretical foundation for why it works so 
effectively in practice. 
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Appendix A. Complex Derivatives and Gradients. 

Throughout the paper we adopt the definition of the formal partial complex 
derivative of a real-valued scalar function / with respect to a complex-valued variable 
x given by [32] M- 



df_ 4 1 

dx* 2 



Of 



df 



8Rc(x) dlm(x) 



(A.1) 



The definition of the complex gradient of a real-valued function / with respect to 
a complex- valued matrix X is given by: 



V x / 



a df 



dX* 



(A.2) 



where [V x /] y - = d f/d[x*U 3 . 

Appendix B. Helpful Lemmas. 

In the proofs of the Theorems stated in this paper we will find the following 
lemmas helpful 

Lemma B.l (Sylvester's Determinant Theorem). We have a "cyclic" property of 
determinants for two matrices A £ Qnxm an g g g Qmxn . 



det (I„ + AB) = det (I m + BA) . 



(B.l) 



In the following four lemmas we denote two Hermitian matrices by Sa,Sb € 
C mxm which have eigenvalues Ai > • • • > A rn and n\ > ■ ■ ■ > [i m , respectively. 
The eigenvalue decomposition of these two matrices are Sa = Ua Aa and 
S B = U B A b U b , where U A ,U B € § rnxm , A A = Diag(A 1; --- ,A m ) and A A = 
Diag(/^i, • • • ,/i m ). 

Lemma B.2 (Theorem 1.3.12 in [23]). The matrices Sa and Sb commute if 
and only if they are simultaneously diagonalizable, i.e., both USaU^ and USbU^ 
are diagonal matrices for some unitary matrix U. 

Lemma B.3 (Richtcr |30]). 

m m 

Mm+i-* <tr(E A S B ) <^X t ^. (B.2) 

i=l i=l 

Remark 2. Sufficient conditions for achieving the upper and lower bounds are 
Ua = Ub and Ua = Ub J m , respectively. The sufficient condition to achieve the 
lower bound was given by Kose and Wesel in Theorem 2 in \26}j and Theobald JJ^jj also 
gave necessary and sufficient conditions for achieving the upper bound, which allow 
for the multiplicity of eigenvalues. 

Lemma B.4 (Lemma 3 in Witzenhausen [48]). 



m 

det (I m + S A S B ) < [] i 1 + X > Mi) 

i=i 
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Remark 3. A sufficient condition for achieving the upper bound is Ua = Ub- 
Witzenhausen gave further sufficient conditions which allow for the multiplicity of 
eigenvalues, stating that if equality holds then Sa and Sb commute and the diago- 
nalizing matrix is such that the eigenvalues are aligned in the same order. 

LEMMA B.5. Let P G C mxn denote a rectangular matrix, Sh G C nxn denote a 
positive semi-definite matrix, and PtSnP G C rnxm be a diagonal matrix with diagonal 
elements in decreasing order (possibly with some zero diagonal elements) . Then, there 
is a matrix of the form P = Vh [ A, ] that satisfies: 

P f S H P = aP f S H P (B.4) 
tr(P P f ) = tr(PP f ) (B.5) 

where a > 1, Vh is a unitary matrix with columns equal to the eigenvectors of matrix 
Sh corresponding to the min(n, m) largest eigenvalues in decreasing order and A is 
square diagonal matrix of size min(n,m). 

Proof. This is a modification of Lemma 3.16 in |31j . □ 

Lemma B.6. For the complex gradient defined in (|A.2[) and general matrices 
A G C mxm , B G C nxn and X G C mxr \ we have: 

V x tr(A X B X f ) = A X B (B.6) 



Proof. Using properties of differentials (32) and (33) from [37] we have: 

<9tr(A XBX') = tr[A 9(X) B X f ] + tr[A X B 9(X f )]. (B.7) 

Together with the results for complex derivatives (219), (220), (221) and (222) 
from [37] we have: 

d 

tr(A X B X f ) = A T X* B T + A X B (B.8) 



<9Re(X) 
. d 



dlm(X) 
and the result follows. □ 



tr(A X B Xt) = -A T X* B T + A X B (B.9) 



Remark 4. This is the counterpart for complex-valued matrices to result (108) 
in \37l for real-valued matrices; note that the term A T X B T is absent in the complex 
case. 

Lemma B.7. For the complex gradient defined in (|A.2[) and general matrices 
A G C mxm , B G C nxn , C G C nxn and X G C mxn , we have: 

V x tr [(A + X B Xt)" 1 (X C X 1 ")] 

= (A + XBXt)- 1 X C [I - Xt (A + XBXt)- 1 XB] (B.10) 
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Proof. Using properties of differentials (32), (33) and (36) in [37] and the abbre- 
viation Y = A + XBX',we have: 

dtr [Y- 1 (X C X*)] = tr {Y^ 1 <9(X C X*)} 
+ tr {-Y _1 d(A + X B X^Y" 1 (X C X*)} 

Applying Lemma. IB. 61 the result follows. □ 

Remark 5. This is the counterpart for complex-valued matrices to result (116) 
in \37l for real-valued matrices; note that we do not require the assumption that B 
and C are Hermitian (symmetric) and there is no factor of 2. 

Appendix C. Proof of Theorem [47T1 

Proof. We first provided an alternative derivation of this proof in [10]. The 
current proof is derived directly from the proof for an identical theorem for radar 
in [46] . We restate the mutual information between the input and output of the 
compressive sensing model in (|3.1[) for a multivariate Gaussian source as follows: 

Z(x;y)=logdet(l m + MtS w 1 MS x ). @2) 

Note that for a unitary matrix U, the kernel P = MU has the same power as M, 
i.e., tr(PPt) = tr(MMt), but it may have different mutual information. In particular, 
a choice of U that maximizes the mutual information for a given M is U = U x . This 
can be seen from Lemma IB. 41 and Remark IB. 41 From Lemma IB. 51 wc know that 
there exists a matrix P = U w [A, ], which satisfies tr(P P ) = tr(PPt) and 

P SrP = ct P^S^P where a > 1. Since the function det(I + a A) is monotonically 
increasing in a for a positive semi-definite matrix A, the optimal kernel matrix must 
have the form of M* = U^A^V^ = U w [ A, ]u£ . 

Finally, we determine the optimal singular values by optimizing the mutual infor- 
mation with respect to the eigenvalues rather than the singular values, since 1) they 
map one-to-one (up to a factor of exp j9, which does not affect the mutual informa- 
tion) and 2) this new optimization problem is convex, so the Karush-Kuhn- Tucker 
(KKT) optimality conditions [5] define the unique global optimum. This is given by: 

(0, l-^i<0, 

^ , ^ !_fe; (cu) 

where 77 is such that the average unit norm row constraint is satisfied, i.e., 4 A^. = 
1, where the eigenvalues of S x are arranged in descending order and the eigenvalues 
of S w are arranged in ascending order , i.e., X xl > • • • > X xm ^ an d < X w i < 
• < Kf 

Appendix D. Proof of Theorem 14.21 

Proof. The proof draws on the work by Payaro and Palomar [35] , which described 
the generalized mercury waterfilling aspect of the solution, but not the mode align- 
ment aspect, and the work by Lamarca [27] . which described the mode alignment 
aspect but did not focus on the generalized mercury waterfilling interpretation. The 
current proof highlights both the mode alignment and mercury waterfilling aspects of 
the solution. 
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The solution to the optimization problem in (|4.1[) satisfies the KKT optimality 
conditions: 



V M -Z(x;y)-77 



£-tr(MM t ) 



= 



M=M* 



v ■ 



tr(M*M* t ) 



= 



(D.l) 
(D.2) 



with r) > 0. Using the relationship between the gradient of the mutual information 
and the MMSE matrix in (|3.7[) . the optimal kernel satisfies: 



• M*M^ = S W J (M* E* M**) 



(D.3) 



We note that (|D.3|) is diagonalized by U M , by definition, from which it can be seen 
that the matrices E^ 1 and M*E*M*t commute. From this observation, together 
with the fact that (|D.3[) is Hcrmitian and Lemma [B. 21 we deduce that: 



u M = u w n^Au 



(D.4) 



where Atj is a diagonal matrix with unit modulus diagonal elements and n*j is a 
permutation matrix. Furthermore, U M M*E*M*^U M is a diagonal matrix, from 
which we can infer: 



v M = u|n^A v 



(D.5) 



where Ay is a diagonal matrix with unit modulus diagonal elements and 11^- is a 
permutation matrix. Both mutual information and the MMSE matrix are independent 
of Atj and Ay, allowing us to write without loss of generality the optimal unitary 
matrices as follows: 



U]y[ — U w 

v M = u* E ir 



where II* is some optimal permutation matrix. 

By setting U M = U w we can now obtain an equivalent channel model: 



= A" 1 / 2 A M Vt 



M 



n 



(D.6) 
(D.7) 



(D. 



where y' = A w y and n = A w w is zero-mean circularly symmetric 

complex Gaussian noise with identity covariance, n ~ CA/"(n; 0,1). 

It was shown in |49j that for a fixed value of Vm the mutual information I(x; y') 
is concave with respect to the squared singular values of Am, i-C., the following opti- 
mization problem has a unique global optimum given by the KKT conditions, where 
An = A M A 



Q 



maximize I(x;y') 

i 

subject to Ajvfj < 

i=l 

A Mi > 



(D.9) 



8 The equivalence is in the sense that the mutual information between the input and the output 
of both models is equal, i.e., Z(x;y) =Z(x;y'). 
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The Lagrangian for this optimization problem is: 

£(A Q )=I(x;y')+r)(l-J2^MA + (D.10) 

\ i=l / i=l 

and the Karush-Kuhn- Tucker conditions state that: 
d 



dA 



Q 



-£(A Q ) 



= (D.ll) 

A q =A* 



fji-A^ =0, i = l,...,* (D.13) 
By using the result from [35] that states that: 

^|-Z(x;y') = Diag(E x , A" 1 ) (D.14) 

where E' = V M E Vm is the MMSE matrix associated with the estimation of x' = 
V M x from y', it is possible to rewrite (|D.11[) as follows: 

r]X Wi - mmsei (V M , A Q ) = r?iA Wi , i = l, ...,£. (D.15) 

where mmsei (Vm, Aq) denotes the i-th diagonal entry of E x / for that particular Aq. 

From the KKT conditions, we know that if A*^. > then r\i = and that r\ > 0. 
For a given value of 77, the value of X* M . can be calculated from the relationship 
r\ X Wi = mmse^ (Vm, Aq) for fixed Am^V? 7^ i. The function mmse^ is non-negative 
and monotonically decreasing in Am* £ [0, 00] for fixed Xmj ,Vj ^ i, and its maximum 

value is given when Xm { = 0. Therefore if rj X Wi > mmse.; (VmjAqIa^ =o) where 

a q Um, =0 = Dia e ( A Mx > • • • > A k-i ' °' A k+i ' • ■ • ' A k) ' then A k = and Vi ¥= o. 

This result is true for all values of Vm, therefore it is also true when Vm = Ug II* 
and the result follows. □ 



Appendix E. Proof of Theorem 15.11 

Proof. Note that: 

p(y)=CAA(y;Mx,£ w + MS x Mt) (E.l) 

and so: 

_ CA f (y;Mx,l( Sw + M S ,M.)) 
a' c (2ir) dct (S w + M S x Mt) a_ 

By substituting this into the expression for Renyi entropy it follows that: 

My) = \i^) k dct(S w + M £ x Mt)l + (E.3) 
L J (a - 1) 

The result now follows using the definition of Shannon entropy for Gaussian 
sources. □ 

Appendix F. Proofs for Gradient of Renyi entropy. 
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Proof. Let us first show that we can express the following relevant gradient 
analytically: 



- V M log 2n J - V M {log det tj } 



(F.l) 
(F.2) 



where = M - Mj) and E 4J - = M (E* + Ej) M* + 2E W . 

The first term is zero, the second term is the mutual information for a complex 
Gaussian distribution and can be evaluated using (|3.7|l , relating the mutual informa- 
tion and MMSE matrix: 



V M logdetE M - = (2E W )~ 1 M Ey 



(F.3) 



where the Ey = 



is the MMSE matrix if the in- 



(E. + E,)" 1 +Mt(2E w )" 1 M 

put signal x was Gaussian distributed with covariance (Ej + Ej) and distorted by 
Gaussian noise with covariance 2£ w . It can also be expressed: 



V M log det H id = E" 1 M (£, + E, 



(F.4) 



where we can use Woodbury's Inversion Lemma to convert between the two. The 
third and final term, using Lemma IB. 71 and chain rule (94) in |32j , can be expressed: 



V M ;// T .,E, ;//, ,)• = M (/', - /',) (/', - II,) ' 

x ;i Af E, JM;E, • E ,,:■}. 



(F.5) 



□ 



